-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix travis - job timeout and test timeout #203
Conversation
- PYTHONPATH=$(pwd) | ||
matrix: | ||
- OTHER_TESTS=true | ||
- PRODUCT_TEST_GROUP_SA_BARE="tests/product/test_installer.py tests/product/standalone/test_installation.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does SA stand for?
EDIT: probably stands for standalone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standalone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not clear that SA means that, what about YS? Please do not use shortcuts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, maybe you can rename the groups to STANDALONE_BARE_TESTS
and YARN_SLIDER_TESTS
. Including the PRODUCT_TEST_GROUP_*
prefix is a little too verbose and redundant.
ed95db2
to
c11c36d
Compare
- PRODUCT_TEST_GROUP=11 | ||
- PRODUCT_TEST_GROUP=12 | ||
- PRODUCT_TEST_GROUP=13 | ||
- PRODUCT_TEST_GROUP=14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
travis not only fails because of timeouts. Travis has low amount of resources. There are tests which spin up a docker cluster which consists of 4 docker containers where each tries to start presto. In such case travis kills some processes as the memory usage is exceeding travis capacity. I think this is the most serious issue.
Moreover, when a process is killed, then presto-admin waits until timeout to reports that presto-server didn't show up. Thus making tests last longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to fix jobs going over 50 minutes and failing. I wasn't considering Travis memory usage as part of this PR.
Travis didn't pass ; / |
haha yea i noted when i made the PR that travis still wouldn't work with the newest version of presto (which this branch uses). i created an experiment branch that uses 148 and those builds have worked. |
"standalone bare, and yarn-slider presto-admin only images", | ||
action="store_true") | ||
|
||
if len(argv) <= 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is best to re-write the argument parsing to use a positional argument. That way you can more succinctly specify that you require at least one argument and you can specify what the available choices are. Here's what I propose:
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("image_type", metavar="image_type", type=str, nargs="+",
choices=["standalone_presto", "standalone_pa", "standalone_bare",
"yarn_slider_pa", "all"], help="Specify the type of "
"image to create. The available choices are: standalone_presto, "
"standalone_pa, standalone_bare, yarn_slider_pa, all")
image_builder = ProductTestImageBuilder()
args = parser.parse_args()
if "all" in args.image_type:
image_builder.test_setup_standalone_pa_only_images()
image_builder.test_setup_standalone_pa_only_images()
image_builder.test_setup_standalone_bare_images()
image_builder.test_setup_yarn_slider_pa_only_images()
else:
if "standalone_presto" in args.image_type:
image_builder.test_setup_standalone_presto_images()
if "standalone_pa" in args.image_type:
image_builder.test_setup_standalone_pa_only_images()
if "standalone_bare" in args.image_type:
image_builder.test_setup_standalone_bare_images()
if "yarn_slider_pa" in args.image_type:
image_builder.test_setup_yarn_slider_pa_only_images()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure this seems a little cleaner
You should squash pretty much all of your commits. At the moment they're broken down at a level that is too granular. I think you should have two commits: one that refactors travis.yml as well as the refactor of I would also like @ebd2 to take a look at this when he gets back on Monday. |
- PRODUCT_TEST_GROUP_YS_PA="tests/product/yarn_slider/test_slider_installation.py" | ||
#-QUARANTINED="tests/product/yarn_slider/test_server_install.py" | ||
install: | ||
- pip install --upgrade pip==6.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif [ -v OTHER_TESTS ]; then | ||
make clean lint dist docs test-rpm | ||
nosetests --with-timer --timer-ok 60s --timer-warning 300s -s tests.unit | ||
nosetests --with-timer --timer-ok 60s --timer-warning 300s -s tests.integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need to set timers for unit and integration tests, they are really short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
20 jobs travis sounds like an overkill, note that you are running |
from time import time | ||
|
||
|
||
def _initialize_module_logger(module_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I suggest we remove this method completely and make this piece of code execute at the module level. In the functions where you refer to the logger
variable make sure you include the line global logger
. Also, I suggest we hardcode the module to "root". I think that's fine because the reason why you would want to have different loggers for every module is to be able to adjust the logging level and logging destination at a more granular level whereas we just print everything at INFO to the console.
I did some experiments and determined that if the code is moved to be module level it will only get executed once.
If a product test case is taking a long time, this decorator can be added to the test and some of its helper function to see where and why the function is taking a long time.
@kokosing good point. I think the tests are grouped as optimally as possible, given how long they take right now. I think the only solution is to rework the tests to run faster (like you said). It also doesn't help that the tests have to be run with Python 2.6, 2.7, which automatically doubles the number of jobs per build. There's no way to work around this is there? @petroav is there a ticket filed somewhere for trying to make the tests fun quicker so we can have less jobs per build or should I do that? I also fixed up the logging in @ebd2 i separated |
@zachyee nah, there's no ticket for speeding up the tests. Feel free to create one but don't work on it. If it takes 20 jobs to bring each run to under 50 minutes, or whatever the timeout is, then so be it. Have you tried with fewer jobs? Things might run faster now with the bigger machines so you should definitely see if you can optimize it. |
|
||
|
||
class ImageBuilder: | ||
def __init__(self, caller): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name caller suggests that it can be something pretty generic. I think we should do the following:
- Let this take any testcase (and rename caller -> testcase)
- Extract the cluster types into tests/product/cluster_types.py and import them here and in base_product_case.
That separates the two dependencies (testcase for assertions and cluster definitions) such that they can be satisfied independently.
Looks good modulo the small refactor and the naming nit. |
@zachyee Please make sure that the problem from prestodb/presto#5689 does not exists in your travis.yml script. Test grouping is ok. Thanks. Using tox and travis to provision different python version is redundant. We should choose one of these, I prefer travis. See #201. You are using |
matrix: | ||
- OTHER_TESTS=true | ||
- SHORT_PRODUCT_TEST_GROUP=0 | ||
- LONG_PRODUCT_TEST_GROUP_STANDALONE_PRESTO_ADMIN="tests/product/test_server_install.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd get rid of the word standalone in all the test group names. It's the only way to run presto-admin right now, and makes the name really long.
(Sorry for chiming in so late, I've been using this patch to be able to run tests on travis, and noticed this. Also--works great! thanks!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I combined a couple of job groups to get rid of 4 jobs (from 18 to 14). That's the best I can do until tests are faster or someone reaches out to Travis to increase the 50-minute timeout limit. Did the cluster_type refactor and renaming. @kokosing thanks for pointing that out. I ran into a similar problem when it wasn't erroring out from lint errors. I went ahead and |
except: | ||
# retry once | ||
rpm_name = StandalonePrestoInstaller._download_rpm() | ||
raise OSError(1, 'Presto RPM not detected.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you have a testcase where this gets called in the constructor. Does it make sense to pass it in here and testcase.fail() if the RPM doesn't exist?
It might not because it isn't a real test case, if it's messier that way, I'm fine with it as-is.
LGTM! |
I wouldn't worry about decreasing the number of jobs further. It's better for it not to need so many jobs, but having the whole thing complete in a reasonable amount of time is more important. |
Looks good |
To fix travis test timeout issue
Created more jobs to fix the case where a single job would take more than 50 minutes to complete.
Moved creating the docker images for the product tests to before the test executes rather than during the test itself. This helps tests avoid their 10 minute limit for not outputting anything.
Added some logging that can be used in the future to figure out why tests are taking a long time (if this ever happens).
Works successfully with presto version 148, but not with the newest version. I can hardcode the URL where the tests download the presto rpm to fetch the 148 version rather than the newest version. This might be useful to test the release? (This PR fetches 149, not 148, so it will fail on travis)